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INTRODUCTION 

The aim of this paper is the investigation of some combinatorial aspects of written 
language, within the framework detennined by the well-known game of crossword 
puzzles. Various types of probabilistic regularities appearing in such puzzles reveal some 
hidden, not well-known restrictions operating in the field of natural languages. Most of 
the restrictions of this type are similar in each natural language. Our direct concern will 
be the Romanian language. 

Our research may have some relevance for the phono-statistics of Romanian. The 
distribution of phonemes and letters is established for a corpus of a deviant 
morphological structure with respect to the standard language. Another aspect of our 
research may be related to the so-called tabular reading in poetry. The correlation 
horizontal-vertical considered in the first part of the paper offers some suggestions 
concerning a bi-dimensional investigation of the poetic sing. 

Our investigation is concerned with the Romanian crossword puzzles published in 
[4]. Various concepts concerning crossword puzzles are borrowed from N. Andrei [3]. 
Mathematical linguistic concepts are borrowed from S. Marcus [1], and S. Marcus, E. 
Nicolau, S. Stati [2], 



SECTION 1. THE GRID 

§1. MATHEMATICAL RESEARCHES ON GRIDS 

It is known that a word in a grid is limited on the left and right side either by a 
black point or by a grid final border. 

We will take into account the words consisting of one letter (though they are not 
clued in the Rebus), and those of two (even they have no sense (e.g. N T, RU,...)), three 
or more letters - even they represent that category of rare words (foreign localities, rivers, 
etc., abbreviations, etc., which are not found in the Romanian Language Dictionary (see 
[3], pp. 82-307 (“Rebus glossary”)). 
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The grids have both across and down words. 

We divide the grid into 3 zones: 

a) the four peaks of the grid (zone A) 

b) grid border (without de four peaks) (zone B) 

c) grid middle zone (zone C) 

We assume that the grid has n lines, m columns, and p black points. 

Then: 

Proposition 1. The words overall number (across and down) of the grid is equal 
to n + m + pNB + 2 • pNC , where 

pNB = black points number in zone B , 

pNC = black points number in zone C . 

Proof: We consider initially the grid without any black points. Then it has 
n + m words. 

- If we put a black point in zone A , the words number is the same. (So it does not 
matter how many black points are found in zone A). 

- If we put a black point in zone B , e.g. on line 1 and column j , i < j < m , 
words number increases with one unit (because on line 1, two words were formed (before 
there was only one), and on column j one word rests, too). The case is analog if we put a 
black point on column 1 and line i, 1 < i < n (the grid may be reversed: the horizontal 
line becomes the vertical line and vice versa). Then, for each point in zone B a word is 
added to the grid words overall number. 

- If we put a black point in zone C , let us say i, 1 < i < n , and column j , 
1 < j < m , then the words number increases by two: both on line i and column j two 
words appear now, different from the previous case, when only one word was there on 
each line. Thus, for each black point in zone C , two words are added at the grid words 
overall number. From this proof results: 
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Corollary 1. Minimum number of words of grid n x m is n + m . Actually, this 
statement is achieved when we do not have any black points in zones B and C . 

Corollary 2. Maximum number of words of a grid n x m having p black points 
is n + m + 2 p and it is achieved when all p black points are found in zone C . 

Corollary 3. A grid n x m having p black points will have a minimum number 
of words when we fix first the black points in zone A , then in zone B (alternatively - 
because it is not allowed to have two or more black points juxtaposed), and the rest in 
zone C . 



Proposition 2. The difference between the number of words on the horizontal and 
on the vertical of a grid n x m is n - m + pNBO - pNBV , where 

pNBO = black points number in zone BO ; 



pNBV = black points number in zone BV . 

We divide zone B into two parts: 

- zone BO = B zone horizontal part (line 1 and n ) 

- zone BV = B zone vertical part (line 1 and m ). 

The proof of this proposition follows the previous one and uses its results. 

If we do not have any black points in the grid, the difference between the words 
on the horizontal and those on the vertical line is n - m . 

- If we have a black point in zone A , the difference does not change. The same 
for zone C . 

If we have a black point in zone BO , then the difference will be n - m - 1 . From 
this proposition 2 results: 

Proposition 3. A grid n x m has n + pNBO + pNC words on the horizontal and 
m + pNBV + pNC words on the vertical. 

The first solving method uses the results of propositions 1 and 2. 

The second method straightly calculates from propositions 1 and 2 the across and 
down words number (their sum (proposition 1) and difference (proposition 2) are 
known). 

Proposition 4. Words mean length (= letters number) of a grid n x m with p 



black points is > 



2 (nm - p) 



n + m + 2 p 

Actually, the maximum words number is n + m + 2 p , the letter number is 
nm - p , and each letter is included in two words: one across and another down. One grid 
is the more crossed, the smaller the number of the words consisting of one or two letters 
and of black points (assuming that it meets the other known restrictions). Because in the 
Romanian grids the black points percentage is max. 

15% out of the total (rounding off the value at the closer integer - e.g. 15% with a 
grid 13x13 equals 25.35 « 25; with a grid 12x12 is 21.6 « 22), so for the previous 

r 3 

properties, for grids n x m with p black points we replace p by — | nm , where 



[x] = maxja e N, a-x< 0.5]. 
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§2. STATISTIC RESEARCHES ON GRIDS 



In [1] we find the notion “ecart of a sound x”, denoted by a{x ) , which equals the 
difference between the rank of x in Romanian and the rank of x in the analyzed text. 

We will extend this notion to the notion of a text ecart which will be denoted by: 
ait ) , and 

1 " 

a(t) = ~Y} a ( A i)\ 
n /=1 

where a(A j ) is A sound ecart (in [1]) and n represents distinct sounds number in text t . 
(If there are letters in the alphabet, which are not found in the analyzed text, these will be 
written in the frequency table giving them the biggest order.) 

Proposition 1. We have a double inequality: 

' n ~\ 

where [y] represents the whole part of real number y . 



n n — 1 1 

0 < a(t ) < h — 

2 n 



Actually, the first inequality is evident. 
f 1 2 ... n) 

Let O = . Then - h\ 

U h - in) ,=1 

This pennutation constitutes a mathematical pattern of the two frequency tables of 
sounds; in Romanian (the first line), in text t (the second line). 

( \ 2 ... n — 1 n ^ 



For permutation i// = 



\ji n — 1 ... 2 ly 



we have 



n 

_ 2 _ 



X|i- 7 ! | = 2[(n-l) + (n-3) + (n-5) + ...] = 2X(«-2k + l) = 

i=l 

= 2 ! 






n 


f 


n 


i n(n - 1) 


n 


— 


n - 


— 


= + 


— 


_ 2 _ 


\ 


_ 2 _. 


) 2 


_ 2 _ 



, n - 1 1 

where a(t) = 1 — 

2 n 



By induction with respect to n> 2 , we prove now the sum 5 = yj i - j\ has max. 

1 = 1 

value for pennutation y/ . 

For n = 2 and 3 it is easily checked directly. Let us suppose the assertion true for 
values <n + 2. Let us show for n + 2 : 



W = 



1 



2 ... n + l n + 2 



yn + 2 n + l ... 2 1 j 



Removing the first and last column, we obtain: 

f 2 ... n + 1^ 

y/' = 

n + l... 2 
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which is a permutation of n elements and for which S will have the same value as for 
pennutation 



( 1 ... n N 
¥"= , , 

\ n - y) 

i.e. max. value ( y/" was obtained from y/' by diminishing each element by one). 

(\ n + 2) 



The pennutation of 2 elements r/ = 



yn + 2 



^ J gives maximum value for 



S. 



But yr is obtained from y/' and 77 ; 



y/(i) = 



\yX 0 , 

U(0, 



if i <£ [l,n + 2} 
otherwise 



Remark : The bigger one text ecart, the bigger the “angle of deviation” from the 
usual language. 

It would be interesting to calculate, for example, the ecart of a poem. 

Then the notion of ecart could be extended even more: 

a) the ecart of a word being equal to the difference between word order in 
language and word order in the text; 

b) the ecart of a text ( ref words): 

1 " 

a c(t) = -Jl\ a c( a i)\ > 

n , = i 

where a c (a.) is word a t ecart, and n - distinct words number in the text t . 



* 



We give below some rebus statistic data. By examining 150 grids [4] we obtain 
the following results: 
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Occurrence frequency of words in the grid, depending on their length (in letters ) 



Letter order 


Letter 


Letter 

occurrence 

mean 

percentage 


Vowels mean 
percentage 


Consonants 

mean 

percentage 


1 


A 


15.741% 






2 


I 


12.849% 






3 


T 


9.731% 






4 


R 


9.411% 






5 


E 


8.981% 






6 


0 


5.537% 






7 


N 


5.053% 






8 


U 


4.354% 


47.462% 


52.538% 


9 


s 


4.352% 






10 


c 


4.249% 






11 


L 


4.248% 






12 


M 


4.010% 






13 


P 


3.689% 






14 


D 


1.723% 






15 


B 


1.344% 






16 


G 


1.290% 






17 


F 


0.860% 






18 


V 


0.806% 






19 


Z 


0.752% 






20 


H 


0.537% 






21 


X 


0.430% 






22 


J 


0.053% 






23 


K 


0.000% 







It is easy to see that a percentage of 49,035% consists of the words formed only of 1, 2 or 
3 letters; - of course, there are lots of incomplete words. 

* 



The study of 50 grids resulted in: 

Occurrence frequency of words in a grid (see next page). 

It is noticed that vowels percentage in the grid (47.462%) exceeds the vowels percentage 
in language (42.7%). 

So, we can generalize the following: 

Statistical proposition (1): In a grid, the vowels number tends to be almost equal 
to 47.5% of the total number of the letters. 

Here is some evidence: one word with n syllables has at least n vowels (in 
Romanian there is no syllable without vowel (see [2]). 

The vowels percentage in Romanian is 42.7%; because a grid is assumed to form 
words across and down, the vowels number will increase. Also, the last two lines and 
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columns are endings of other words in the grid; thus they will usually have more vowels. 
When black points number decreases, vowels number will increase (in order to have an 
easier crossing, you need either more black points or more vowels) (A vowel has a bigger 
probability to enter in the contents of a word than a consonant.) 

Especially in “record grids” (see [3], pp. 33-48) the vowels and consonants 
alternation is noticed. Another criterion for estimating the grid value is the bigger 
deviation from this “statistical law” (the exception confirms the rule!): i.e. the smaller the 
vowel percentage in a grid, the bigger its value. 

Statistical proposition (2): Generally, the horizontal words number 73 equals the 
vertical one. 

Here is the following evidence: 100 classical grids were experimentally analyzed, 
in [4], getting the percentage of 49.932% horizontal words. Usually, the classical grids 
are square clues, the difference between the horizontal and vertical words being (see 
Proposition 2): 

n - m + pNBO - pNBV = pNBO - pNBV . 

The difference between the black points number in zone BO and zone BV can 
not be too big (±1, ±2 and rarely +3). (Usually, there are not many black points in 
zone B, because it is not economical in crossing (see proof of Proposition 1)). 

Taking from [1] the following letters frequency in language: 



l.E 


5.N 


9. L 


13. P 


17. G 


21. J 


2.1 


6.T 


10. S 


14. M 


18. F 


22.X 


3. A 


7.U 


11. O 


15. B 


19. Z 


23. K 


4.R 


8.C 


12. D 


16. V 


20. H 





(because in the grid A, A, I , §, T: are replaced by A: I: S: T, respectively, in the above 
order they were cancelled) the ecart of the 150 grids becomes 

«(s) = if>(A)|* 1-391; 

^ i = 1 

the entropy is: 

logioPi ~ 3-865 

log io 2 ,- =1 

and the infonnational energy (after O. Onicescu) is: 

£(k) = i>, 2 * 0.084 

1=1 

Examining 50 grids we obtain: 
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Words frequency in a grid with respect to the syllables number 



Mean percentage of occurrence of a word in a grid 


Mean 
length of 
a word in 
syllables 


1 

syllable 


2 


3 


4 


5 


6 


7 


8 




35.588% 


26.920% 


21.765% 


9.551% 


5.294% 


0.882% 


0.000% 


0.000% 


2.246 



(in the category of the one syllable-words, the word of one, two or, three letters, without 
any sense - rare words - were also considered.) One can see that the percentage of words 
consisting of one and two syllables is 65.508% (high enough). 

Another statistics (of 50 grids), concerning the predominant parts of speech in a 
grid has established the following first three places: 

1. nouns 45.441% 

2. verbs 6.029% 

3. adjectives 2.352% 

Notice the large number of nouns. 



SECTION II. REBUS CLUES 



§1. STATISTICAL RESEARCHES ON REBUS CLUES 

Studying the clues of 100 “clues grids”, the following statistical data resulted: 
Rebus clues frequency according to their length (words number ) 

(see the next page) 

It is noticed that the predominant clues are formed of 2, 3, or 4 words. For results 
obtained by investigating 100 “clues grids”, see the next page. 

It is worth mentioning that vowels percentage (46.467%) from rebus clues 
exceeds vowels percentage in the language (42.7%). 

By calculating the clues ecart (in accordance with the previous formula) it results: 

«(^) = ^I>(A)|«1-185 

^ ' i=1 
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(sound frequency used by Solomon Marcus in [1] was used here), the entropy (Shannon) 
is: 

H \ log loPi « 4.226 

log ,0 2 , =1 

and informational energy (O. Onicescu) is: 

27 

E(dr) = Y,P?~ 0-062 . 

i=i 

(The calculations were done by means of a pocket calculator ). 

Letters occurrence frequency in the rebus clues 



Letter 

order 


Letter 


Mean 
percentage 
of letter 
occurrence 
in clues 


Vowels 

percentage 


Conso- 

nants 

mean 

percentage 


Letters no. 

(mean) 
necessary 
to clue a 
grid 


Mean 
length of a 
word (in 
letters) 
used in 
clues 


1 


E 


10.996% 










2 


I 


9.778% 










3 


A 


9.266% 


46.679% 


53.321% 


657.342 


4.374 


4 


R 


7.818% 










5 


U 


6.267% 










6 


N 


6.067% 










7 


T 


5.611% 










8 


C 


5.374% 










9 


L 


4.920% 










10 


O 


4.579% 










11 


P 


4.027% 










12 


A 


3.992% 










13 


S 


3.831% 










14 


1 


3.309% 










15 


D 


3.079% 










16 


A 


1.801% 










17 


V 


1.527% 










18 


F 


1.449% 










19 


$ 


1.360% 










20 


T 


1.338% 










21 


G 


1.330% 










22 


B 


1.238% 










23 


H 


0.532% 










24 


J 


0.358% 










25 


Z 


0.092% 










26 


X 


0.037% 










27 


K 


0.024% 
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